Celestin Apprentice 4

home *** CD-ROM | disk | FTP | other *** search

/ Celestin Apprentice 4 / Apprentice-Release4.iso / Utilities / Programming / EnterAct 3.5 / Drag_on Modules / hAWK programs / $FrequencyWord < prev next >

Wrap

Text File | 1993-04-09 | 2.5 KB | 68 lines | [TEXT/KEEN]

#$FrequencyWord: print sorted list of words, together with number #of times each word occurs. Use with any input option, #select "Show stdout" since that's where the output goes. #This program prints the words in order by frequency of occurrence, #whereas $WordFrequency prints them alphabetically. #The file "common words" in the "hAWK programs" folder contains #a list of words to skip. To do a better job, you can create a #custom list - for example, the word "while" can be skipped in #ordinary text, but should be included if the text deals with #C or hAWK programming. If this file is missing, the program #will still run, but common words will not be skipped (this #uses more memory, and runs slower). #Tech note: the words in “common words” are loaded into the #(associative) array “common[]”; as with all arrays in hAWK, #retrieval of an element is done with a hash table, so retrieval #of an element given the index or checking for the existence #of an index with the “in” operator is very fast. Thus there would #be no real advantage to keeping the common words in alphabetical #order. Also, duplicate words cause no problems. #This isn't perfect, but is very useful as-is. It's a simple #program, one you can tinker with easily - try it out on #some small files, and refinements will suggest themselves. # User’s Manual references: # «hAWK User’s Manual» «F Running hAWK programs» # «hAWK User’s Manual» «L 5 Regular expressions» # «hAWK User’s Manual» «M 5 Built-in string and file functions» # «hAWK User’s Manual» «K 4 Built-in variables» # «hAWK User’s Manual» «K 8 Arrays» # «hAWK User’s Manual» «N User-defined functions» # «hAWK User’s Manual» «P 3 The getline function» # «hAWK User’s Manual» «O 3 Output into files» # «hAWK User’s Manual» «Q The hAWK function» BEGIN { #Get list of common words to skip. commonfile = STDPATH "Drag_on Modules:hAWK programs:" "common words" while (getline < commonfile > 0) { for ( k = 1; k <= NF; k++) common[$k] = 1; #Forces common[$k] to "exist". } close(commonfile) $0 = "" ## time_it = 1 if (time_it == 1) print "Starting time", time() } { #Remove non-word characters, count words. gsub(/[^A-Za-z_0-9$'-]+/, " ") #or try gsub(/\W+/, " ") #W == [^A-Z_a-z0-9] for ( k = 1; k <= NF; k++) { if (length($k) > 1 && !($k in common)) count[$k]++; } } END { #Sort associative array, and print words with count. m = sort(count, ind, "rn") for (j = 1; j <= m; ++j) print ind[j], "\t\t", count[ind[j]] if (time_it == 1) print "Finishing time", time() }